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Learning Cover Context-Free Grammars from 


Structural Data 
Mircea MARIN, Gabriel ISTRATE? 


Abstract 

We consider the problem of learning an unknown context-free gram- 
mar from its structural descriptions with depth at most @. The structural 
descriptions of the context-free grammar are its unlabelled derivation 
trees. The goal is to learn a cover context-free grammar (CCFG) with 
respect to @, that is, a CFG whose structural descriptions with depth at 
most @ agree with those of the unknown CFG. We propose an algorithm, 
called L.A*, that efficiently learns a CCFG using two types of queries: 
structural equivalence and structural membership. The learning proto- 
col is based on what is called in the literature a “minimally adequate 
teacher.” We show that LA‘ runs in time polynomial in the number of 
states of a minimal deterministic finite cover tree automaton (DCTA) 
with respect to @. This number is often much smaller than the number 
of states of a minimum deterministic finite tree automaton for the 
structural descriptions of the unknown grammar. 


Keywords: automata theory and formal languages, grammatical 
inference, structural descriptions 


1 Introduction 


Angluin’s approach to grammatical inference [1] is an important contribution 
to computational learning, with extensions to problems, such as composi- 
tional verification and synthesis [6, 11], that go beyond the usual applications 
to natural language processing and computational biology [5]. 
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Practical concerns, e.g. [9], seem to require going beyond regular 
languages to classes of languages with regular tree nature. However, Angluin 
and Kharitonov have shown that learning context-free grammars (CFGs) 
from membership and equivalence queries is intractable under plausible 
cryptographic assumptions [2]. A way out is to learn structural descriptions 
of CFGs, that is, trees obtained from the derivation trees of the grammar 
by unlabelling all its internal nodes. Sakabibara has shown that Angluin’s 
algorithm L* extends to this setting [12], and proposed a learning algorithm 
LA that runs in time polynomial in the number of states of a minimal 
deterministic bottom-up tree automaton for the structural descriptions 
of the unknown grammar and the maximum size of any counterexample 
returned by a structural equivalence query. His approach has applications in 
learning the structural descriptions of natural languages, which describe the 
shape of the parse trees of well chosen CFGs. 


Often, these structural descriptions are subject to additional restrictions 
arising from modelling considerations. For instance, in natural language 
understanding, the bounded memory restriction on human comprehension 
seems to limit the recursion depth of such a parse tree to a constant. A 
natural example with a similar flavour is the limitation imposed by the ATfx 
system, that limits the number of nestings of itemised environments to a 
small constant. For such applications, a reasonable requirement is to restrict 
our interest to structural descriptions whose depth is bound by a constant, 
say @, and to learn a deterministic tree automaton A which recognises all 
structural descriptions of depth at most ¢; for structural descriptions of 
larger depth, the behaviour of A is irrelevant. We call such a tree automaton 
a deterministic cover tree automaton (DCTA) for depth @. If, instead of 
structural descriptions of depth at most @ we consider learning a set of strings 
with length at most @, the problem boil down to learning a minimum cover 
automaton (DCA) for them. Minimal cover automata were first discussed 
by Campeanu et al. in [4], and an efficient algorithm capable to learn a 
minimal DCA for finite sets of word with length at most ¢ was described by 
Ipate in [8]. Ipate’s algorithm, called L‘, learns such an automaton in time 
polynomial in the number of its states. 


In this paper, we extend Ipate’s approach to the learning of structural 
descriptions of CFGs up to a constant depth @. We propose an algorithm 
called LA which asks two types of queries: structural equivalence and 
structural membership queries, both restricted to structural descriptions 
with depth at most @, where £ is a constant. L.A‘ stores the answers retrieved 
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from the teacher in an observation table which is used to guide the learning 
protocol and to construct a minimal DCTA of the unknown context-free 
grammar with respect to @. Our main result shows that LA‘ runs in time 
polynomial in n and m, where n is the number of states of a minimal DCTA 
of the unknown CFG with respect to @, and m is the maximum size of a 
counterexample returned by a failed structural membership query. 

The paper is structured as follows. Section 2 introduces the basic notions 
and results to be used later in the paper. It also describes algorithm LA. 
In Sect. 4 we introduce the main concepts related to the specification and 
analysis of our learning algorithm LA‘. They are natural generalisations to 
languages of structural descriptions of the concepts proposed by Ipate [8] in 
the design and study of his algorithm L‘. In Sect. 5 we analyse the space and 
time complexity of L.A‘ and show that its time complexity is a polynomial 
in n and m, where n is the number of states of a minimal deterministic 
finite cover automaton w.r.t. @ of the language of structural descriptions 
of interest, and m is an upper bound to the size of the counterexamples 
returned by failed structural equivalence queries. 


2 Preliminaries 


We write N for the set of nonnegative integers, A* for the set of finite strings 
over a set A, and e¢ for the empty string. If v,w € A*, we write v < w if 
there exists w’ € A* such that vw’ = w; v < uv’ ifu <u’ andv 4’; and 
v L w if neither v < w nor w < v. 


Trees, Terms, Contexts, and Context-free Grammars 


A ranked alphabet is a finite set F of function symbols together with a finite 
rank relation rk(F) C Fx N. We denote the subset {f € F | (f,m) € rk(F)} 
by Fm, the set {m | (f,m) € rk(F)} by ar(f), and User ar(f) by ar(F). 
The terms of the set T(F) are the strings of symbols defined recursively by 
the grammar t::=a| f(t1,...,tm) where a € Fo and f € F,, with m > 0. 
The yield of a term t € 7(F) is the finite string yield(t) € F§ defined as 
follows: yield(a) := a if a € Fo, and yield(f(ti,...,tm)) := W1...Wm where 
w; = yield(t;) for 1<i<m. 

A finite ordered tree over a set of labels F is a mapping t from a nonempty, 
finite, and prefix closed set Pos(t) C (N \ {0})* into F. Each element in 
Pos(t) is called a position. The tree t is ranked if F is a ranked alphabet, 
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and t satisfies the following additional property: For all p € Pos(t), there 
exists m € N such that {7 © N | pi € Pos(t)} = {1,...,m} and t(p) € Fim. 

Thus, any term t € 7 (F) may be viewed as a finite ordered ranked tree, 
and we will refer to it by “tree” when we mean the finite ordered tree with 
the additional property mentioned above. The depth of t is d(t) := max{||p|| | 
p € Pos(t)} where ||p|| denotes the length of p as sequence of numbers. The 
size sz(t) of t is the number of elements of the set {p € Pos(t) | ||p|] 4 a(t)}, 
that is, the number of internal nodes of t. 

The subterm t|, of a term t at position p € Pos(t) is defined by the 
following: Pos(t|,) := {i | pi € Pos(t)}, and t|,(p’) := t(pp’) for all p! € 
Pos(t|p). We denote by t[u], the term obtained by replacing in t the subterm 
t|» with u, that is: Pos(t{u],) = (Pos(t) — {pp’ | p’ € Pos(t|p)}) U {pp” | 
p” € Pos(u)}, and 


» _ J up") if p! = pp" with p” € Pos(u), 
tlulp(p') = { t(p’) otherwise. 


The set C(F) of contexts over F is the set of terms over F U {e}, where: 
- e is a distinguished fresh symbol with ar(e) = {0}, called hole, 
- rk(F U fel) =rk(F) U {(0,0)}, and 


- every element C € C(F) contains only one occurrence of e. This is the 
same as saying that {p € Pos(C) | C|, = e} is a singleton set. 


If C € C(F) and u € C(F) UT(F) then Clu] stands for the context or 
term C[u]p, where C|, = e. The hole depth of a context C € C(F) is 
d(C’) := ||p|| where p is the unique position of C' such that C|, = e. From 
now on, whenever M is a set of terms, P is a set of contexts, and m is a 
non-negative integer, we define the sets Mj) := {t € M | d(t) < m} and 
Pm) += {C € P| d(C) < m}. Thus, if A is a set of terms and/or contexts, 
the subscript [m] of A indicates that its elements have depth at most m, 
and the subscript (m) of A indicates that its elements are contexts with hole 
depth at most m. 

We assume that the reader is acquainted with the notions of CFG and 
the context-free language £(G) generated by a CFG G, see, e.g., [13]. A 
CFG is e-free if it has no productions of the form X — e. It is well known [7] 
that every e«-free context-free language L (that is, « ¢ L) is generated by 
an ¢-free CFG. The derivation trees of an e-free CFG G = (N,»,P,S) 
correspond to terms from 7(N U%) with ar(a) = {0} for ala € © and 
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ar(X) = {m | A(X > a) € P with |la|| = m} for all X € N. The sets 
Dg(U) of derivation trees issued from U € NU® and D(G) of derivation 
trees of G are defined recursively as follows: 


De(a) := {a} ifae d, 


De(X):= [4 {X (fi,.3,0) | tre De) A Ate Don} 
(XU}...Um)€P 


D(G) := De(S). Note that L(G) = {yield(t) | t € D(G)}. 


Structural Descriptions and Cover Context-free Grammars 


A skeletal alphabet is a ranked alphabet Sk = {co}, where a is a special 
symbol with ar(c) a finite subset of N \ {0}, and a skeletal set is a ranked 
alphabet Sk U A where Sk A= 0 and ar(a) = {0} for alla € A. Skeletal 
alphabets are intended to describe the structures of the derivation trees of 
e-free CFGs. For an e-free CFG G = (N,», P, S') we consider the skeletal 
alphabet Sk with ar(c) := {llal| | (X — a) € P}, and the skeletal set SkUX. 
The skeletal (or structural) description of a derivation tree t € Dg(U) is the 
term sk(t) € 7(SkUX) where 


"One a ift=aed, 
PV o(sk(t1),-.-,8k(tm)) if t = X(t,...,tm) with m > 0. 


For example, if G is the grammar ({S,A}, {a,b}, {S > A,A > aAb,A > 
ab},S) then t = S(A(a, A(a, b),b)) € Dg(S) and sk(t) = o(o(a, (a,b), b)) € 
T ({o,a,b}), where ar(o) = {1, 2,3} and ar(a) = ar(b) = {0}. Graphically, 


we have 


If M is a set of ranked trees, then the set of its structural descriptions is 
K(M) := {sk(t) | t € M}. Two context-free grammars G; and G2 over 
the same alphabet of terminals are structurally equivalent if K(D(G,)) = 
K(D(G2)). 


Definition 1 (cover CFG). Let ¢ be a positive integer and Gy be an e-free 
CFG of a language U C &*. A cover context-free grammar of Gy with 
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respect to ¢ is an e-free context-free grammar G" such that K(D(G‘))iq = 
K(D(Gv))ig- 


Tree Automata 


The definition of a tree automaton presented here is equivalent with that 
given in [12]. It is non-standard in the sense that it cannot accept any tree 
of depth 0. 


Definition 2. A nondeterministic (bottom-up) finite tree automaton (NFTA) 
over F is a quadruple A = (O,F,Q,A) where Q is a finite set of states, 
QO; C O is the set of final states, and A is a set of transition rules of the 
form f(q@,---;4m) 32 @q wherem>1, f € Fm, U,---54m € FoUQ, and 
qe Q. 


Such an automaton A induces a move relation +4 on the set of terms 
T(FU Q) where ar(q) = {0} for all g € Q, as follows: 


t,t if there exist C € C(F U Q) and f(m,.--,@m) 27 @€ A such 
that t= Cl} (gi,.5+,0m,) and f = Cll. 


The language accepted by A is L(A) := {t € T(F) |t >% ¢ for some q € Qs} 
where —*, is the reflexive-transitive closure of +4. In this paper, a regular 
tree language is a language accepted by such an NF'TA. Two NFTAs are 
equivalent if they accept the same language. 

A = (QO, F, 2, A) is deterministic (DFTA) if the transition rules of A 
describe a mapping 6 which assigns to every m € ar(F) \ {0} a function 
dm: Fm > (Fo U Q)™ > QO such that f(m,...,@¢m) > q € A if and only 
if dbm(f)(q1,---;@m) = g. The extension 6* of {dm | m € ar(F) \ {0}} to 
T (F) is defined as expected: 6*(f(t1,...,tm)) := dm(f)(0*(t1),..-, 0" (tm)) if 
m > 0, and 6*(a) =a if a€ Fo. Note that L(A) = {t € T(F) | 6*(t) € Os}. 

Two DFTAs A; = (O, F, O¢, A) and Ap = (Q’, F, Of, A’) are isomor- 
phic if there exists a bijection y : QU Fy > Q' U Fo such that y(Qs) = Q+, 
y(a) = a for all a € Fo, and for every f € Fm (m > 0), G1,---54m € FoUQ, 
P(Sm(F)(G15-+++9m)) = Om (F)(P(G1),---(Gm)). A minimal DFTA of a reg- 
ular tree language L C T(F) \ Fo is a DFTA A with the minimal number 
of states such that £(A) = L. 

There is a strong correspondence between tree automata and e-free 
CFGs. The NFTA corresponding to an e-free CFG G = (N,»,P,S) is 
NA(G) = (N,SkU%,{S},A) with A := {o(Uy,...,Um) > X | (X > 
U,...Um) € P}. Conversely, the e-free CFG corresponding to an NFTA A = 
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(Q,SkU%X, Os, A) over the skeletal set Sk UX is G(A) = (QU {S}, zu, P,S) 
where S$ is a fresh symbol and P := {q > qi.--dm | (o(@1,---;Qm) 7 
q) € AJU{S > q1..-dm | (o(q1,---,@m) 2 g) € A with q € Oe}. These 
constructs are dual to each other, in the following sense: 


(A) If G is an e-free CFG then L(NA(G)) = K(D(G)). —_[12, Prop. 3.4] 


(Ag) If A= (O, SkU, Os, A) is an NFTA for the skeletal set Sk UX then 
K(D(G(A))) = £(A). That is, the set of structural descriptions of 
G(A) coincides with the set of trees accepted by A. [12, Prop. 3.6] 


We recall the following well-known results: every NFTA is equivalent to an 
DFTA [10], and every two minimal DFTAs are isomorphic [3]. 


Cover Tree Automata 


Definition 3 (determinstic DCTA). Let £¢ Nt and A be a tree language 
over the ranked alphabet F. A deterministic cover tree automaton (DCTA) 
of A with respect to € is a DFTA A over a skeletal set Sk UFo such that 
L(A)iq = K(A)g- 


The correspondence between tree automata and e-free CFGs is carried 
over to a correspondence between cover tree automata and cover CFGs. 
More precisely, it can be shown that if Gy is an e-free CFG, then a DFTA 
A is a DCTA of K(D(Gy)) w.r.t. ¢ if and only if G(A) is a cover CFG of 
Gy w.r.t. &. 


3 Learning Context-free Grammars 


In [12], Sakakibara assumes a learner eager to learn a CFG which is struc- 
turally equivalent with the CFG Gy of an unknown context-free language 
U C »* by asking questions to a teacher. We assume that the learner and 
the teacher share the skeletal set Sk U % for the structural descriptions in 
K(D(Gy)). The learner can pose the following types of queries: 


1. Structural membership queries: the learner asks if some s € 7(SkU%) 
is in K(D(Gy)). The answer is yes if so, and no otherwise. 


2. Structural equivalence queries: The learner proposes a CFG G’ and 
asks whether G’ is structurally equivalent to Gy. If the answer is 
yes, the process stops with the learned answer G’. Otherwise, the 
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teacher provides a counterexample s from the symmetric set difference 
K(D(G')) A K(D(Gv)). 


This learning protocol is based on what is called minimal adequate teacher 
in [1]. Ultimately, the learner constructs a minimal DFTA A of kK(D(Gy)) 
from which it can infer immediately the CFG G’ = G(.A) which is structurally 
equivalent to Gy, that is, k(D(G’)) = K(D(Gy)). In order to understand 
how A gets constructed, we shall introduce a few auxiliary notions. 

For any subset S of 7(Sk UX), we define the sets 


G5) = U LJ{o(s1,.-., 8m) le]: | $1,---,5m € SUS}, 
mé€ar(a) t=1 
X(S) := {Ci[s] | C1 € 00(S),sE SUX}\S. 


Note that o.(S) = {C € C(SkUX) \ {e} | Cl, € SUXU {e} for all 
p € Pos(C) NN}. 


Definition 4. A subset E of C(SkU%) is e-prefix closed with respect to 
a set S C T(SkUS) if C € E \ {e} implies the existence of C' € E and 
Cy € 06(S) such that C = C'[C\]. If E CC(SkUX) and S C T(SkUX) 
then E|S] denotes the set of structural descriptions defined by E|S] = {C{s] | 
CeE,seS}. 

We say that S C T(SkU%) is subterm closed if d(s) > 1 for all 
s€S, ands’ €S whenever s' is a subterm of some s € S with d(s') > 1. 


An observation table for K(D(Gy)), denoted by (S, £,T), is a tabular 
representation of the finitary function T : E[S U X(S)| — {0,1} defined by 
T(t) := 1 if t € K(D(Gu)), and 0 otherwise, where S is a finite nonempty 
subterm closed subset of 7(SkU%), and E£ is a finite nonempty subset of 
C(SkU%) which is e-prefix closed with respect to S. Such an observation 
table is visualised as a matrix with rows labeled by elements from SU X(S), 
columns labeled by elements from EF, and the entry for the row of s and 
the column of C' equal to T(C[s]). If we fix a listing (C1,...,C;) of all 
elements of E, then the row of values of some s € S U X(S) corresponds 
to the vector row(s) = (T(C;4[s]),...,7'(C,[s])). In fact, for every such s, 
row(s) is a finitary representation of the function f, : EF — {0,1} defined by 
fs(C) = T(C|s)). 

The observation table (5, £,T) is closed if every row(x) with x € X(S) 
is identical to some row(s) of s € S. It is consistent if whenever 51,52 € S 
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such that row(s;) = row(s2), we have row(C}{[s1]) = row(Cj|[s2]) for all 
Ci € 0.(S). 

The DFTA corresponding to a closed and consistent observation table 
(S,E,T) is A(S, £,T) = (QO, Sk UX, O¢,6) where 


O := {row(s)|s ES}, O¢ := {row(s)|s € S and T(s) = 1}, 
and 6 is uniquely defined by 
Om()(@1;+-+50m) = rowl(a(r1,.-.,Tm)) for all m € ar(c), 


where r; := a if q =a € %, and r; := s; if q; = row(s;) € Q. 


It is easy to check that, under these assumptions, A(S,E,7) is well-defined, 
and that 6*(s) = row(s). Furthermore, Sakakibara proved that the following 
properties hold whenever (S,£,7) is a closed and consistent observation 
table: 


1. A(S, E,T) is consistent with T, that is, for alls € SUX(S) andC € E 
we have 6*(C|s]) € Qs iff T(C[s]) = 1. (12, Lemma 4.2] 


2. If A(S, E,T) = (O,Sk UX, O¢,6) has n states, and A’ = (Q’,SkU 
~, O;,6’) is any DFTA consistent with T that has n or fewer states, 
then A’ is isomorphic to A(S, F,T). [12, Lemma 4.3] 


The LA Algorithm 


In this subsection we briefly recall Sakakibara’s algorithm L.A whose pseu- 
docode is shown in Figure 1. L.A extends the observation table whenever one 
of the following situations occurs: the table is not consistent, the table is not 
closed, or the table is both consistent and closed but the CFG corresponding 
to the resulting automaton A(S, £,T) is not structurally equivalent to Gy 
(in which case a counterexample is produced). The first two situations trigger 
an extension of the observation table with one distinct row. From properties 
(A,) and (Ag), if n is the number of states of the minimal DFTA for the 
structural descriptions of Gy, then the number of unsuccessful consistency 
and closedness checks during the whole run of this algorithm is at most 
n —1. For each counterexample of size at most m returned by a structural 
equivalence query, at most m subtrees are added to S. Since the algorithm 
encounters at most n counterexamples, the total number of elements in S' 
cannot exceed n+m-n, thus LA must terminate. It also follows that the 
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Set S = @ and E = {e} 
let G’ := ({S}, =,0,8) 
check if G’ is structurally equivalent with Gy 
if answer is yes then halt and output G’ 
if answer is no with counterexample t then 
add ¢ and all its subterms with depth at least 1 to S$ 
construct the observation table (5, £,T) using structural membership queries 
repeat 
while (S, £,7T) is not closed or not consistent 
if (S, E,T) is not consistent then 
find 51,82 € S,C € E, and C; € o.(S) such that 
row(s1) = row(s2) and T(C[Ci[s1]]) 4 T(C[C1[s2]]) 
add C[C\] to E 
extend T to E[S U X($)] using structural membership queries 
if (S, E,T) is not closed then 
find s; € X(S) such that row(s1) 4 row(s) for alls € $ 
add s; to S 
extend T to E[S U X($)] using structural membership queries 
/* (S, E,T) is now closed and consistent */ 
let G’ := G(A(S, E,T)) 
make the structural equivalence query between G’ and Gy 
if the reply is no with a counterexample ¢ then 
add ¢ and all its subterms with depth at least 1 to S 
extend T to E[S U X(S)] using structural membership queries 
until the reply is yes to the structural equivalence query between G’ and Gy 
halt and output G’. 


Figure 1: Sakakibara’s algorithm 


Learning Cover Context-Free Grammars from 
Structural Data 263 


number of elements of the domain ES U X(S$)] of the function T is at most 
(nt m-nt+ (l+m-n+p)%)-n=O(m4-n4*), where 1 is the number of 
distinct ranks of 0 € Sk, p is the cardinal of Fo, and d is the maximum rank 
of a symbol in Sk. A careful analysis of LA reveals that its time complexity 
is indeed bounded by a polynomial in m and n [12, Thm. 5.3}. 


4 Learning Cover Context-free Grammars 


We assume we are given a teacher who knows an e-free CFG Gy for a 
language U C %*, and a learner who knows the skeletal set Sk U % for 
K(D(Gvy)). The teacher and learner both know a positive integer @, and the 
learner is interested to learn a cover CFG G’ of Gy w.r.t. ¢ or, equivalently, 
a cover DCTA of K(D(Gy)) w.r.t. ¢. The learner is allowed to pose the 
following types of questions: 


1. Structural membership queries: the learner asks if some s € T (SkUX)j9 
is in K(D(Gy)). The answer is yes if so, and no otherwise. 


2. Structural equivalence queries: The learner proposes a CFG G’, and 
asks if G’ is a cover CFG of Gy w.r.t. @. If the answer is yes, the 
process stops with the learned answer G’. Otherwise, the teacher 
provides a counterexample from the set K(D(Gu)) iq A K(D(G’)). 


We will describe an algorithm LA‘ that learns a cover CFG of Gy with 
respect to @ in time that is polynomial in the number of states of a minimal 
DCTA of the regular tree language K(D(Gu)). 


4.1 The Observation Table 


LA‘® is a generalisation of the learning algorithm L‘ proposed by Ipate [8]. 
Ipate’s algorithm is designed to learn a minimal finite cover automaton of 
an unknown finite language of words in polynomial time, using membership 
queries and language equivalence queries that refer to words and languages 
of words with length at most @. Similarly, LA‘ is designed to learn a minimal 
DCTA A’ for K(D(Gy)) with respect to by maintaining an observation 
table (S,E,T, 2) for K(D(Gy)) which differs from the observation table of 
LA in the following respects: 


1. $ is a finite nonempty subterm closed subset of T(Sk UX) q. 
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2. E is a finite nonempty subset of C(Sk UX) (¢_1) NC(SkU%)@ which is 
e-prefix closed with respect to S. 


3. T: E[SU X(S)ig] > {1,0, -1} is defined by 


1 ifteE K(D(Gu))«qg; 
DG) = 0 ifte T(SkUd)q \ K(D(Gu)), 
—1 ift é T(SkUD)g. 


In a tabular representation, the observation table (S,£,T,@) is a two- 
dimensional matrix with rows labeled by elements from SU X(S$)ji, columns 
labeled by elements from EF, and the entry corresponding to the row of t and 
column of C' equal to T(C[t]). If we fix a listing (C,,...,C,) of all elements 
from E, then the row of t in the observation table is described by the vector 
(T(Ci[t]),...,0(Cx[t])) of values from {—1,0, 1}. The rows of an observation 
table are used to identify the states of a minimal DCTA for K(D(Gy)) with 
respect to ¢. But, like Ipate [8], we do not compare rows by equality but by 
a similarity relation. 


4.2 The Similarity Relation 


This time, the rows in the observation table correspond to terms from 
SU X(S)jq, and the comparison of rows should take into account only terms 
of depth at most @. For this purpose, we define a relation ~, of k-similarity, 
which is a generalisation to terms of Ipate’s relation of k-similarity on strings 
[8]. 


Definition 5 (k-similarity). For 1<k < @ we define the relation ~, on the 
elements of the set SU X(S) of an observation table (S,E,T,¢) as follows: 


sz t if, for every C © Ecx_maxfa(s),a(t)})» T(Cls]) = T(ClE)). 


When the relation ~; does not hold between two terms s,t €E SUX(S), we 
write s ~;, t and say that s and t are k-dissimilar. When k = £ we simply say 
that s and t are similar or dissimilar and write s ~ t or s ~ t, respectively. 

We say that a context C ¢-distinguishes s, and s2, where 51,52 € S, 
ifCce Five_max{a(s1),d(s2)}) and T(C[s1]) fx T(C|s9]). 


Note that only the contexts C € Ey~_max{a(s),a(t)}) With d(C) < ¢ are 
relevant to check whether s ~; t, because if d(C) > @ then d(C|s]) > @ and 
a(C{t]) > £, and therefore T(C|s]) = —1 = T(C[t]). Also, if t € SU X(S) 
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with d(t) > @ then it must be the case that t € me and then t ~, s for 
alls € SUX(S) and1<k < & because Ep_maxf{a(t),a =. 

The relation of k-similarity is obviously ae oe symmetric, but 
not transitive. The following example illustrates this fact. 


Example 1. Let © = {a,b}, k = 1, € = 2, S = {o(a),a(b), o(c(a), b)}, 
E = {e,o(e,b)}, t; = o(a), tg = o(o(a),d), ts = o(b), and 


Gu = ({S, A}, {a, b}, {S > a,S > b,S > Ab, A> a, A > Abd},S). 


S is a nonempty subterm closed subset of T(SkU X) jg, and FE is a nonempty 
subset of C(Sk U &) e_1) which is e-prefix closed with respect to S. We 
have K(D (Gu))jg = = {t1, to, ts}, ti ~¢ te because Eye max{d(t1) He = {e} 
and T(e[t1]) = 1=T(elts]), and tz ~¢ t3 because E(p_max{a(tz),4 = {e} 
and T(e|t2]) = 1 = T(e|t3]), However, ti ~¢ tz because C = oe b) € 
Eqy = Ee—max{a(ts),a(ts)}) 2nd T(Clt1]) = T(o(e(a), b)) = T(te) = 1, but 
T(Clts]) = T(a(a(b), b)) = 0. O 


Still, k-similarity has a useful property, captured in the following lemma. 


Lemma 1. Let (S,E,T,£) be an observation table. If s,t,z € SU X(S) 
such that d(x) < max{d(s),d(t)}, then s ~~, t whenever s ~~, x and x ~x t. 


Proof. Suppose s ~,; « and x ~, t. By definition of ~;%, we have 
T(C[s]) = T(C{z}) for all C € Fie max{a(s),a(x)})» and 
T(Cla]) = T(C[t]) for all C € E(e_—max{a(z),a(t)})- 


Let m := max{d(s),da(t)}. Since d(a) < m, it follows that for every C € 
E(K—m) we also have C' € Fik—max{d(s),a(2)}) and C' € Fik—max{a(z2),a(t)})- 
Thus T(C[s]) = T(C[z]) = T(C{t]) for all C € Ey,_my. Hence s ~x t. O 


In addition, we will assume a given total order < on the alphabet %, 
and the following total orders induced by =< on T(SkU%) and C(SkU%). 


Definition 6. The total order <7 on T(SkU%) induced by a total order < 
on & is defined as follows: s <7 t if either (a) d(s) < a(t), or (b) d(s) = d(t) 
and 


1. s,t €X ands Xt, or 


2g = OS, ay Sine b = O bigness ty) One there exists: 1 ke min, 1) 
such that sp <7 t, and s;=t; for alll <i<k, or 
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G(Biges24 Sin) ONE C—O hiyeee sta) 1 Ny, Gnd. 8; =} ‘for 


The total order <c¢ on C(SkUS) induced by a total order < on & is defined 
as follows: Cy <¢ Co if either (a) de(C1) < de(C2) or (b) do(C1) = ae (C2) 
and C, <7 C2 where C1, C2 are interpreted as terms over the signature with 
x extended with the constant e such that e ~<a for alla€e d. 


Definition 7 (representative). Let (S,E,T,¢) be an observation table and 
xE€SUX(S). We say x has a representative in S if{seS|s~a}F 90. If 
so, the representative of x is r(x) := ming,{se S|ar~ s}. 


We will show later that the construction of an observation table (5, FE, T, 2) 
is instrumental to the construction of a cover tree automaton, and the states 
of the automaton correspond to representatives of the elements from SUX (S). 
Note that, if (S, £, T, 2) is an observation table and x € SUX(S) has d(x) > @ 
then « € X(S) and x ~ s for all s € S. Then s <7 x because d(s) < ¢ < d(z) 
for all s € S. Thus « has a representative in S, and r(x) = min, S. For this 
reason, only the rows for elements + € SU X(S)jg are kept in an observation 
table. 


4.3. Consistency and Closedness 


The consistency and closedness of an observation table are defined as follows. 


Definition 8 (Consistency). An observation table (S,E,T,¢) is consistent 
if, for every k © {1,...,€}, 51,82 € S, and Ci € a(S), the following 
implication holds: If 8; ~% 82 then Cy[s1] ~~ C1[s9]. 


The following lemma captures a useful property of consistent observation 
tables. 


Lemma 2. Let (S,F,T,¢) be a consistent observation table. Let m € ar(c), 
1<k< @, and 51,...,5m,t1,...,tm € SUX such that, for alll <<i<m, 
either 8s; = ti € XY, or 8,ti € S, 8; ~~ ti, and d(si) < d(ti), ands = 
O( 8133454 Sil pb = OH cee) es Then 8 EE 
Proof. Let I = {i1,...,ip} = {i € {1,...,m} | 9;,t; € S}. If J = @ then 
s =t and the result follows from the reflexivity of ~,. If 1 £0, let xo := s, 
and 2; := rj-1(ti,]i,; for 1 < 7 < p. For all 1 < j < p we have 
Si,;, bi, es 
Si, ~k bi; p> Bj-1 = Tj-1[Si; |i; Ye Tj-ilti |i, = Tj 
tj—i[e]i; € o(S) 
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because the observation table (S,£,7,¢) is consistent. Thus 29 ~,% 271, 
+5 Lp-1 ~k Lp, and d(#p) < d(#1) < ... < d(xp_1) < d(xp). Repeated 
applications of Lemma 1 yield x9 ~,% rp. But x = s and xz, = t, thus 
Spt. O 


Definition 9 (Closedness). An observation table (S,E,T,¢) is closed if, for 
alla € X(S), there exists s € S with d(s) < d(x) such thata~ s. 


The next five lemmata capture important properties of closed observa- 
tion tables: 


Lemma 3. /f (S, E,T,¢) is closed then every x € SU X(S) has a represen- 
tative, and d(r(x)) < d(z). 


Proof. If « € S then x has a representative since {s E S| a ~ s} AQ. 
Then r(x) <7 x, which implies d(r(x)) < d(x). If « € X(S) then, since the 
observation table is closed, there exists s € S with x ~ s and d(s) < d(z). 
x~ sands €S imply r(x) <r s, hence d(r(a)) < d(s). Thus d(r(a)) < d(x) 
because d(s) < d(z). O 


Lemma 4. /f (S,E,T,¢) is closed, r1,r2 € {r(x) | x € SUX(S)}, and 
ry ~ rq then ry = 12. 


Proof. Suppose r; = r(x1) and rg = r(#2) for some 21,22 € SU X(S). By 
Lemma 3, 71,72 € S and d(r1) < d(a1) < max{d(a1), d(r2)}. Since x1 ~ rj 
and r; ~ rg, Lemma 1 implies x; ~ ro, thus ro € {s € S| x, ~ s} and 
ry = minz,{s € S | 2 ~ s} <7 ro. By a similar argument, we learn that 
rg ~7 71. From ry 7 rg and rg <7 rT, we conclude that r; = ro. O 


Lemma 5. /f (S,E,T,@) is closed andr € {r(x) | x € SU X(S)}, then 


tga 


Proof. Let r; = r(r). Then r; ~ r and r1,r € {r(x) | cE SUX(S)}. By 
Lemma 4, r= 1}. O 


Lemma 6. /f (S,E,T,£) is closed, then for everyx € SUX(S) and C, € 
de(S), there exists s € S such that r(Ci[r(x)]) = r(s). 


Proof. Let x € SUX(S) and Cy € o.(S). The fact that (S, E,T, 2) is closed 
implies r(x) € S, thus Ci(r(xz)) € SU X(S) and therefore r(C;[r(x)]) € S. 
We can choose s := r(Cj[r(x)]) € S for which r(s) = s, by Lemma 5. O 


Lemma 7. Let (S, E,T,¢) be closed, r € {r(x) |x € SUX(S)}, Ci € 02(S), 
ands €S. If C\|s] ~r then d(r) < a(Cj[s}). 
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Proof. We provide a proof by contradiction. Assume d(r) > d(Cj|[s]). Since 
Ci[s] € SU X(S) and (S,E,T,£) is closed, r(Ci[s]) € S, d(xr(Cil[s])) < 
a(Ci[s]) (by Lemma 3), r(Ci[s]) ~ Ci[s], and Ci[s] ~ r. Thus r(Cj[s]) ~ r 
by Lemma 1. Since r,r(Cj[s]) € {r(x) | « € SU X(S)}, we have r = 
r(C[s]) by Lemma 4. Thus d(r) = d(r(Ci[s])) < a(Ci[s]), which yields a 
contradiction. Oo 


The Automaton A(T) 


Like L‘, our algorithm relies on the construction of a consistent and closed 
observation table of the unknown context-free grammar. The table is used 
to build an automaton which, in the end, turns out to be a minimal DCTA 
for the structural descriptions of the unknown grammar. 


Definition 10. Suppose T = (S,E,T,2£) is a closed and consistent obser- 
vation table. The automaton corresponding to this table, denoted by A(T), 
is the DFTA (Q,SkU, Q¢,6) where QO := {r(s)|s € S}, Of = {qe Q| 
T(q) = 1}, and 6 is uniquely defined by 6m(o)(M1,---;Qm) = r(o(q1,---;Gm)) 
for all m € ar(o). 

The transition function 6 is well defined because, for all m € ar(c) 
and qi,---;Qm from Q, C := o(¢,q2,..-;@m) € oe(S), thus o(q1,.--, dm) = 
Cila] € SU X(S) and r(Ci[q]) = r(s) for some s € S, by Lemma 6. 
Hence, r(o(q1,---,@m)) € Q. Also, the set Qs can be read off directly 
from the observation table because e € EF (since FE is e-prefix closed), thus 
q = eq) € E[(SU X(S)j@] for all g € Q, and we can read off from the 
observation table all g € Q with T(q) = 1. 


In the rest of this subsection we assume that T = (S, ET, ¢) is closed 
and consistent, and 6 is the transition function of the corresponding DFTA 
A(T). 

Lemma 8. 6*(x) ~ x and d(d*(x)) < d(x) for everyxE SUX(S). 


Proof. By induction on the depth of x. If d(a) = 1 then 6*(x) = r(x) ~ x 
and d(6*(a)) = d(r(x)) < d(x) by Lemma 3. 

If d(x) > 1 then x = o(81,..., 5m) with $1,...,8m € SU, and 6*(x) = 
r(o(q1,---;Qm)), where q; = 6*(s;) for 1 <i< m. Let I := {i | 5; g X}. 
Then, by induction hypothesis for all i € I, q; ~ s; and d(q;) < d(s;). Thus 


: : ee wie = } Sod Gis acs Ops) AG (Sipe vk Sap) Se), 


O*(%) =r(o(q1,.--,dm)) => d(d*(x)) < d(o(a1,.--,@m)), by Lemma 2. 
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Hence d(d*(x)) < d(x) follows from d(d*(x)) < d(o(q@1,-.--,@m)) < a(x). 
To prove 0*(x) ~ x, we notice that xz = 0(s1,...,5m) ~ o(q@1,---;Qm) 
follows from Lemma 2. Thus 


6*(z) = r(o(q1, tee »4m)) SS a(n, tees Gm); 
O(M1,+++54m) ~ 2, = 0*(x) ~ a by Lemma 1. 
d(o(qi,..-,dm)) < a(x) < max{d(x),d(d"(x))} 


O 
Corollary 1. 6*(x) =z for all x € {r(s)| se SUX(S)}. 


Proof. By Lemma 8, x ~ 6*(x). Since both 6*(«) and x belong to the set 
of representatives {r(s) | s © X UX(S)}, « = 6*(x) by Lemma 4. O 


The following theorem shows that the DFTA of a closed and consistent 
observation table is consistent with the function T on terms of depth at 
most @. 


Theorem 1. Let T = (S,E,T,¢) be a closed and consistent observation 
table. For every s © SUX(S) and C € E such that a(C[s]) < € we have 
d*(C[s]) € O¢ if and only if T(C[s]) = 1. 


Proof. Let s € SU X(S) and C € E such that d(C|s]) < @. We proceed by 
induction on the hole depth of C. If d(C) = 0 then C = e and C|s] = s has 
d(s) < &. By Lemma 8, 6*(s) ~ s and d(é*(s)) < d(s). Thus, since e € E 
and d(s) < ¢, T(6*(s)) = 1 if and only if T(s) = 1. By definition of A(T), 
O*(s) € O¢ if and only if T(6*(s)) = 1. Hence 6*(s) € Q¢ if and only if 
1s) 1) 

If d.(C’) =m > 0 then d(C|s]) < ¢ implies m < ¢—d(s) and C € E(m). 
Since EF is e-prefix closed, there exist C’ € E(m—1) and Cy € o4(S) such that 
C= C'[Ci] € Ey). Let t = 6*(Ci[s]). Then d(t) < d(Ci[s]) by Lemma 8, 
thus d(C’[t]) < a(C’[C[s]]) = d(C[s]) < @, and we learn from the induction 
hypothesis for C’ that 6*(C’[t]) € QO¢ if and only if T(C’[t]) = 1. Since 


rad Calas by Corollary 1 


t=9*(Ci[s]) by definition } => 5 (Cle) = *(CTCr[s]]) = (CIs), 


we have 6*(C[s]) € O¢ = 6*(C’[t]) € Or = T(C'[t]) = 1. Therefore, it 
suffices to show that T(C’|t]) = 1 if and only if T(C[s]) = 1. By Lemma 
8, t~ Ci[s] and d(t) < d(Ci[s]), thus d(C’ [t]) < a(C’[Ci[s]]) = a(C[s]) < 2. 
Hence, since C’ € E and t ~ C,[s], T(C"[t]) = 1 if and only if T(C’[C,[s]]) = 
1. O 
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Theorem 2. Let T = (S,E,T,¢) be a closed and consistent observation 
table, and N be the number of states of A(T). If A’ is any other DFTA with 
N or fewer states, that is consistent with T on terms of depth at most £, 
then A’ has exactly N states and L(A(T))q = L(A’) ig. 


Proof. Let A’ = (Q’, SkU™, O%, 6’) and f : Q > O defined by f(q) = 6 (q) 
for all q € Q. We show that f is injective. If q1,q2 € Q such that q: 4 qo 
then T(C[m]) #4 T(Clq2]) for some C € Eye_max{a(q:),a(qo)})- Since A’ is 
consistent with T, exactly one of 6’/*(C[q]) and 6*(C[q2]) is in Q$. Hence 
f(a) 4 f(q2). Since f is injective and Q’ has at most the same number of 
states as Q, f is bijective. Thus QO’ = f(Q) = {6*(q) | q € Q}. 

Next, we show that Of = {f(q) | q € Qs}. By Theorem 1, 6*(q) € QO: 
if and only if T(q) = 1. By Corollary 1, 6*(q) = q for all gq € Q. Thus 
Or = {q € Q| T(q) = 1}. Similarly, since A’ is consistent with T, for every 
q € QO, T(q) = 1 if and only if 6*(q) € OQ}. By definition, 6(q) = f(q). Thus, 
Qe = {0"*(q) |q © Q and T(q) = 1} = {f(@) | @€ Qs}. 

We prove by induction on the depth of x € T(SkU ™)jiq \ = that, if 
6*(@) =q and 6*(x) = f(q’) then the following statements hold: 


1. d(q) < d(z), 
2. d(q’) < a(x), 
3. if m = ¢— d(x) + max{d(q), d(q’)}, then ¢ ~m qd. 


In the base case, d(x) = 1 and q = r(x). By Lemma 3, gq ~ x and d(q) < d(z), 
thus d(q) can only be 1. 6*(x) = f(q’) implies 6(C[a]) = 6(C[q']) for all 
C € Eve_max{a(q'),a(z)})- Since A’ is consistent with T on T(Sk U X)jg, this 
implies T(C|z]) = 1 if and only if T(C[q]) = 1. Therefore x ~ q’. From 
r(z) ~ a, x~ qd, and d(x) = 1 < max{r(z), q’}, we learn by Lemma 1 that 
dq ~ r(x) =q. Then q=q' by Lemma 4, because qg,q € QO = {r(s) | s € S} 
and q~ q’. Thus d(q’) = d(q) < d(z). In this case, m = ¢ and statement 3 
obviously holds because ~z is reflexive. 

In the induction step, we assume that all three statements hold for 
all terms s € T(SkUX)y) \ % with k > 1. Let r € T(SkUD)@ with 
d(x) = k+1. Then x = o(@,...,2p) with d(a) < k for 1 <i < p. Let 
I:={i|1<i<pand a ¢ ¥}, and g,q,u,¢ € Q such that 6*(x) = gq, 
o* (a) = Ff (9'); and .¢,= 0" (a;),:0'* (a3) =f (@) for allt-e J. 

Let y := o(s1,...,Sp) where s; := 2; if 7; € U and s; := q otherwise. 
Then y€ SU X(S) and q=r(y), thus q~ y and d(q) < d(y) by Lemma 3. 
Also d(y) < d(x) because y = o(1,..., Sp), © =0(X1,...,2p), and 
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- a(s;) = d(q) < a(a;) for all i € I, by induction hypothesis, 


- s; = a; for alli € {1,...,p}\ J, hence d(s;) = d(a,) for all i € 


{lyosny Dh VE 


Thus d(q) < d(x) follows from d(q) < d(y) and d(y) < d(z). 

To show d(q’) < d(x), we reason as follows. 6’(q') = f(q') = 60"%(a) = 
6,(o)(6*(a1),...,6*(ap)). Since 6*(a;) = 6*(q)) for all i € I, and 6*(2;) = 
x; for all i € {1,...,p}\ JI, we learn that 6*(q') = 6*(z) where z = 
o(ti,...,tp) with t; := qj if7 © I and t; := 2; ifi € {1,...,p}\ J. Note that 
z€SUX(S) and 6*(Clq']) = 6*(C[z]) for all C € Fe_max{a(q'),4(z)})- Thus 
T(C[q]) =1 if and only if T(C[z]) = 1 because A’ is consistent with T on 
T(SkUS)igq. Therefore q’ ~ z, and since z = C;[qj] with C; = z[e]; € o.(S) 
and gq; € S for any i € I, we can apply Lemma 7 to learn that d(q’) < d(z). 
Also 


- for alli € I, d(t;) = d(q{) < d(a;) by induction hypothesis, and 
~ for all¢€4{1,...,ph)\ i, i =a, thus d(t;) = d(z;), 


therefore d(z) = d(o(t1,...,tp)) < d(o(x1,...,2p)) = a(x). From d(q’) < 
d(z) and d(z) < d(x) we learn d(q’) < d(z). 

Let m = €—d(x)+max{d(q), d(q’)}. We prove g ~m q by contradiction. 
If qm q' there exists C € Ey_azy) such that T(C[q]) # T(Clq']). Then 
q ~ y and d(q) < d(y) < d(a), thus d(C{q]) < a(Cly]) < a(Clz]) < ¢ 
and T(C{q]) = T(Cly]). Also, q’ ~ z and d(q’) < d(z) < d(x), thus 
a(C{q']) < a(C[z]) < a(C[a]) < ¢ and T(C[q']) = T(C[z]). Thus T(Cly]) 4 
T(C[z]). On the other hand, by induction hypothesis, q; ~m, qj for all i € J, 
where m; = ¢— d(x;) + max{d(qi),d(q)}. Let’s assume I = {i1,...,i-}, 
Ci := yle]i,, and Cyy1 = Cola, |e lijss for all 1 < j < r. Then Cj € o.(5) 
and C¥|qi,| ~mi, oF [a] for all 1 < j < r, because the observation table is 
consistent. Therefore T(C%[Cj[qi,|]) = T(C§[C; [a.,1]) whenever 1 < j <r 
and Ci € E\e—a(w;,)-1): Since d(x) = 1+ max{d(a;,) | 1 <7 <r}, we have 
Ce E(e—a(ei,)-1)1 thus T(C[Cj[a,]]) = T(CIC; [a,]]) for alll <k <r. Note 
that 


el 
Q 
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T(C[C2{qie]]) = T(C[Cald,]]) = --. 
= T(C[C;|4,]]) = T(ClCrlg, ]]) = T(Clz]) 


which yields a contradiction. 
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Finally, we prove that L(A(T))iq = L(A’). Let x € T(SkUX)g 
and q,q’ € Q such that 6*(x) = q and 6*(x) = 6*(q'). Then ¢ ~m 7 
where m = ¢— d(x) + max{d(q),d(q')}. Since d(x) > max{d(q), d(q’)} and 
ec E,T(q) =T(7) € {0,1}. A(T) is consistent with T on T(SkUX)y and 
6*(q) = q, thus q € Q, if and only if T(q) = 1. A’ is also consistent with T on 
T(SkUX)g, thus 6(q') € Q; if and only if T(q') = 1. Since T(q) = T(q), 
we have q € Q¢ if and only if f(q) € Q;. Thus 6*(x) € Q; if and only if 
6 (x) € Qt. That is, x ¢ L(A(T)) 9 if and only if x € L(A’) ig. O 


Corollary 2. Let A be the automaton corresponding to a closed and con- 
sistent observation table (S,E,T,¢) of the skeletons of a CFG Gy of an 
unknown language U, and N be its number of states. Let n be the number of 
states of a minimal DCTA of K(D(Gy)) with respect to ¢. If N >n then 
N=n and A is a minimal DCTA of K(D(Gy)) with respect to £. 


Proof. Let A’ be a minimal DCTA of K(D(Gy)) with respect to @. Then 
A’ is consistent with T on T(Sk U™%)jq and has n states. Since n < N, 
by Theorem 2, n = N and L(A)g = L(A’)ig = K(D(Gu)) jg. Thus A is a 
minimal DCTA of K(D(Gy)) with respect to 2. O 


The LA‘ Algorithm 


The algorithm LA‘ extends the observation table T = (9, £,7,£) whenever 
one of the following situations occurs: the table is not consistent, the table 
is not closed, or the table is both consistent and closed but the resulting 
automaton A(T) is not a cover tree automaton of K(D(Gy)) with respect 
to £. 

The pseudocode of the algorithm is shown in Figure 2. Consistency 
is checked by searching for C € EF and C; € o.(S) such that C[C,] will 
é-distinguish two terms s1,s2 € S not distinguished by any other context 
OC’ € E with d,(C’) < d.(C/C,]). Whenever such a pair of contexts (C,C}) is 
found, C[C,] is added to EF. Note that C[Ci] € C(SkUX)(p_1) NC(SkUX) ig 
because only such contexts can distinguish terms from S, and the addition 
of C[C\] to E yields a e-prefix closed subset of C(SkU%) p_1) NC(SkUX) iq. 

The search of such a pair of contexts (C,C1) is repeated in increasing 
order of the hole depth of C’, until all contexts from E have been processed. 
Therefore, any context C[C;] with C € E and C1 € oe(S) that was added to 
FE because of a failed consistency check will be processed itself in the same 
for loop. 
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ask if ({S},=,0,S) is a cover CFG of Gy w.r.t. @ 
if answer is yes then halt and output the CFG ({S}, ©, ,S) 
if answer is no with counterexample t then 
set S := {s|s is a subterm of t with depth at least 1} and E = {e} 
construct the table T = (S, £,T,@) using structural membership queries 
repeat 
repeat 
/* check consistency */ 
for every C € E, in increasing order of i = d,(C) do 
search for 51,52 € S with d(s,),d(s2) < @—i-—1 and C; € o.(S) 
such that C[C, [s1]]), C[C; [s9]] € TSR U x) ig) 
$1 ~k 82 where k = max{d(s1),d(se)}+7+1, 
and T(C{Ci[s1]]) ¢ T(C{Ci[sa]]) 
if found then 
add C[C] to E 
extend T to E[S U X(S)jg] using structural membership queries 
/* check closedness */ 
new_row_added := false 
repeat for every s € S, in increasing order of d(s) 
search for Cy € (S$) such that Ci[s] ~ t for all t € Siac, js} 
if found then 
add C;[s] to $ 
extend T to E[S U X(S)jg] using structural membership queries 
new_row_added := true 
until new_row_added = true or all elements of S have been processed 
until new_row_added = false 
/* T is now closed and consistent */ 
make the query whether G(.A(T)) is a cover CFG of Gy w.r.t. @ 
if the reply is no with a counterexample t then 
add to S all subterms of t, including t, with depth at least 1, 
in the increasing order given by <r 
extend T to E[S U X(S)jg] using structural membership queries 
until the reply is yes to the query if G(A(T)) is a cover CFG of Gy w.r.t. £ 
halt and output G(A(T)). 


Figure 2: Algorithm LA‘ 
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The algorithm checks closedness by searching for s € S and C; € 0.(S) 
such that Ci[s] ~ t for all t © S for which d(t) < d(Ci[s]). The search is 
performed in increasing order of the depth of s. If s and C) are found, C1[s] 
is added to the S component of the observation table, and the algorithm 
checks again consistency. Note that adding Cj[s] to S yields a subterm 
closed subset of T(Sk UX) iq. Also, closedness checks are performed only on 
consistent observation tables. 

When the observation table is both consistent and closed, the cor- 
responding DFTA is constructed and it is checked whether the language 
accepted by the constructed automaton coincides with the set of skeletal 
descriptions of the unknown context-free grammar Gy (this is called a 
structural equivalence query). If this query fails, a counterexample from 
L(A(T)) ig A K(D(Gy))j\q is produced, the component S of the observation 
table is expanded to include ¢ and all its subterms with depth at least 1, 
and the consistency and closedness checks are performed once more. At the 
end of this step, the component S of the observation table is subterm closed, 
and F is unchanged, thus e-prefix closed. 

Thus, at any time during the execution of algorithm LA®, the defining 
properties of an observation table are preserved: the component S is a 
subterm closed subset of T(Sk U X)ig, and the component F is a e-prefix 
closed subset of C(Sk U ©) 1) NC(SkU X) ig. 


5 Algorithm Analysis 


We notice that the number of states of the DFTA constructed by algorithm 
LA® will always increase between two successive structural equivalence 
queries. When this number of states reaches the number of states of a 
minimal DCTA of kK(D(Gy)), the constructed DFTA is actually a minimal 
DCTA of K(D(Gy)) (Corollary 2) and the algorithm terminates. 

From now on we assume implicitly that n is the number of states of 
a minimal DCTA of K(D(Gy)) with respect to ¢, and that T(t) is the 
observation table (S*, E*, 7, £) before execution step t of the algorithm. By 
Corollary 2, Q* will always have between 1 and n elements. Note that the 
representative of an element s € S in Q* is a notion that depends on the 
observation table T(t). Therefore, we will use the notation r,(s) to refer 
to the representative of s € S* in the observation table T(t). With this 
notation, Q* = {rz(s) | s € S*}. 

Note that the execution of algorithm LA‘ is a sequence of steps char- 
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acterised by the detection of three kinds of failure: closedness, consistency, 
and structural equivalence query. The t-th execution step is 


1. a failed closedness check when the algorithm finds C, € o.(S*) and 
s € S* such that Ci|s] ~ ¢ for all t € S* with d(t) < d(C\|s]), 


2. a failed consistency check when the algorithm finds C € E* with 
do(C) =i, 51,82 € S* with d(s1),d(s2) < @—i-—1, and Ci € o(S"*), 
such that C[Ci[s1]], C[Ci[se]] ¢ T(Sk UX), s1 ~~ 82 where k = 
max{d(s1),d(s2)} +i+1, and T(C[C,[s1]]) 4 T(C[Ci[s2]]), 


3. a failed structural equivalence query when the observation table T(t) 
is closed and consistent, and the learning algorithm receives from the 
teacher a counterexample t € T(Sk UX) q as answer to the structural 
equivalence query with the grammar G(A(S"*, E*,T, @)). 


In the following subsections we perform a complexity analysis of the algorithm 
by identifying upper bound estimates to the computations due to failed 
consistency checks, failed closeness checks, and failed structural equivalence 
queries. 


5.1 Failed Closedness Checks 


We recall that the t-th execution step is a failed closedness check if the 
algorithm finds a context Cy € o.(S) and a term s € S* such that Ci[s] ~ t 
for all t € S* with d(t) < d(Ci[s]). We will show that the number of failed 
closedness checks performed by algorithm L.A‘ has an upper bound which is 
a polynomial in n. To prove this fact, we will rely on the following auxiliary 
notions: 


- For r,r’ € Q*, we define r <tr’ if either d(r) < d(r’) or d(r) = d(r’) 
and there exists t’! < t such that r € Q* but r’ ¢ gv (that is, r 
became a representative in the observation table before r’). 


- To every set of representatives Q* = {r1,...,Tm} with ry <}... <i rm 
we associate the tuple tp1(Q*) := (di,...,dn) € {1,...,€+1}” where 
d; :=a(r;) if1<i<m,andd,:=+1ifm<i<n. 


- We consider the following partial order on N": (@1,...,%n) < (x4,.--,2%,) 
iff there exists 7 € {1,...,n} such that 2; < 2 and x; < a’, for all 
Lay Ste 
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- We denote by st,(i) the i-th component of Q* in the order given by 
<z. 

Lemma 9. Suppose s has been introduced in S**! as a result of a failed 

closedness check. There exists p € Pos(s) such that ||p|| = d(s) and for every 


prefix p! of p different from p, A(te41(sly’)) = (slp). 


Proof. We prove by induction on i that for every i € {0,...,d(s) — 1} there 
exists a sequence of positions pp < pi <... <p; from Pos(s) such that, for 
all 0 < 7 <i, the following statements hold: 


(L1): |lpjll = 9 and d(s|p;) = a(s) — J, 


(L2): d(re41(slp;)) = A(slp,)- 


For i = 0 we reason as follows: Since s has been introduced in Stt! as a 


result of a failed closedness check, s ~ t for all t € S* with d(t) < d(s). Then 
s becomes a new element of the set ert rt+1(s) = s and, if we choose 
po = €, the sequence of positions po fulfils requirements (L1) and (L2). 

For the inductive step, assume the condition holds for 0 < i < d(s) — 1, 
that is, there exists a sequence of positions po < ... < p; from Pos(s) 
which fulfils requirements (L1) and (L2) for all 0 < 7 < i. We show 
that this sequence can be extended with a position pj+1 € Pos(s) such 
that requirements (L1) and (L2) hold for 7 =i+1. Let x := s|,,. Then 
d(rz41(2)) = d(x) and, since d(x) = d(s) —7 andi < d(s) — 1, we have 
d(x) > 1. Therefore, we can write x = o(%1,...,%m) such that I := {j € 
{iiss | dae) 1p: 

Assume, by contradiction, that no such position p;,1 exists. Let qj := 
re41(#;) for all 7 € I, and y = o(y1,..-, Ym) where y; := q; if 7 € J and 
yj; <= 2; otherwise. Then y € St*1U X(S*t!), g; ~ x; and d(q;) < d(z;) 
for all j € J. It follows that d(y) < d(x), and xz ~ y in T(t +1), by Lemma 
2. We distinguish two cases: 


1. ye St*!, Then d(re41(x)) < d(y) < d(x), which is a contradiction. 
2. ye X(Stt!). Then y ~ z for all z € St! with d(z) < d(y), because: 


If there exists z € St*! with d(z) 
then « ~ z (by Lemma 1) and d(z) 


d(re41(x)) = d(2). 


(y) such that y ~ z, 


<a 
< d(x), which contradicts 
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As d(y) < d(x) = d(s|p,) < d(s), y would be introduced in S**+ instead 
of s as the result of a failed closedness check. This also provides a 


contradiction. 
Thus, there exists a sequence of positions po < ... < Pa(s)-1 from Pos(s) 
such that requirements (L1) and (L2) hold for all 7 € {0,1,...,d(s) — 1}. It 
follows that the statement of this lemma holds for p = pgs)-1. O 


Corollary 3. Whenever the t-th execution step is a failed closedness check, 
the term introduced in S**! is in Qt+! \ Q* and its depth is at most j, 
where j is the position in Q**! of the newly introduced element according to 


ordering ~%t1 , 


Proof. If s is introduced in S*+! by a failed closedness check then s ~ t for 
all t € S* with d(t) < d(s). Therefore, s ¢ Q*t! \ Q*. Furthermore, from 
the proof of the previous lemma we know there exists a sequence 


Po < Pi <-++ < Pa(s)-1 


of positions from Pos(s) with d(rt+1(s|p,)) = J for all 0 < 7 < d(s). Since 
re+1(S|p,) € Q**? for all 0 < j < d(s) and ri41(s) = 8, we have 


t+1 


Te41(Slpa)-1) <t ~<FT* re11 (slp) <7"? esa (slp) = 8 


d(s) elements 


we conclude that, if s = stz41(j), then d(s) < 7. Oo 


Corollary 4. d(s) <n for all s € S* which were introduced in the table by 
a failed closedness check. 


Proof. d(s) < 7 by Cor. 3, and 7 < n because |Q*| < n for all t. Thus 
d(s) <n. 


Lemma 10. Let j be the position of the element introduced in Q*t! by 
a failed closedness check. Then tp1(Q*t') < tp1(Q*) and d(stz41(J)) < 
d(st+(7)). 


Proof. Let r be the representative newly introduced in Q*t! at position j 
(that is, r = ste41(j)), k = d(r), and @’ := max{i | d(st:(7)) < k}. Then 
j =7i' +1 and we distinguish two situations. 


1. If r replaces a representative with depth k’ at position 7’ in Q* then 
k<k,i <j’ andj =i +1. Thus j < 7’ and 
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- ifl<i<j then st,(i) = stt41(4), 
- d(stz(j)) > k = d(ste41(9)), 
- if7<i< 7’ then d(sty(2)) > d(stz(¢ — 1)) = d(stz41(2)), 
- if 7’ <i<n then st,(i) = ste41(i). 
Hence tp1(Q**") < tp1(Q*). 


2. Otherwise, r is newly introduced at position j = i’ +1 in Q*t! and all 
elements of Q* are preserved in Q*t!. If |Q*| = m then 
- ifl<i<j then st,(¢) = ste41(2), 
- A(ste(J)) > k = d(ste+i (9), 
- if 7 <i<mthen d(st,(2)) > d(st,(¢ — 1)) = d(stz41(7)), 
- d(sty(m+1)) =€+1> d(ste41(m+ 1)) 


which, again, implies tp1(Q**!) < tp1(Q*). O 


Theorem 3. The number of failed closedness checks performed during the 
entire run of LA® is at most n(n + 1)/2. 


Proof. By Corollary 4, tp1(Q*) is always a tuple of the form (dj,...,dn) 
with d; € {1,2,...,n}U{@€+1}. Also, by Lemma 10, tp1(Q*) > tp1(Q**") 
and d(stz41(j)) <j whenever st;41(j) is the state introduced in Q*t! by 
a failed closedness check. It is also easy to see that tp1(Q*) > tp1(Q**') 
always holds. Since tp1(Q°) = (¢+1,...,¢+ 1), and every j-th component 
of the tuples tp1(Q*) can be changed by a failed closedness check at most 
j times (if the first change is to value 7, and the other are decrements by 
1 down to minimum value 1), we conclude that the maximum number of 
failed closedness checks in any sequence 


tp1(Q°) > tp1(Q') >... > tp1(Q*) 


is at most )09_14j = n(n + 1)/2. Oo 


5.2 Failed Consistency Checks 


The t-th execution step is a failed consistency check if the algorithm finds 
Ce E* with d(C) = 1, 81,82 € S* with d(s1),d(s2) < €-—i-—1, and 
Cle TAS"), such that C[C;[s1]], C[Ci [52] E T(Sk U X) ig S81 ~k 82 where 
k = max{d(s1),d(s2)} +7141, and T(C[Ci[s1]]) A T(C[Ci[s2]]). In this 
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case, the context C[C,] is newly introduced in the component E**! of the 
observation table T(t + 1). 

We will show that the number of failed consistency checks performed 
by the learning algorithm LA‘ has an upper bound which is a polynomial in 
n. To prove this fact, we rely on the following auxiliary notions: 


- For C,C’ € E*, we define C <§ C’ if either d.(C’) < d.(C”) or d(C) = 
d(C’) and there exists t’! < t such that C € E*’ but C’ ¢ E™ (that 
is, C became an experiment in the observation table before C’). 


- We define 6;(s1, 82) := min.,{C € E* | C “distinguishes s; and s2} 
for every 81,82 € S* such that s, ~ so. 


- A nonempty subset U of E* induces a partition of a subset R of S* into 
equivalence classes Q1,...,Qm if the following conditions are satisfied: 


1. Uli, Qj = R and Q: Qj; = 0 whenever 1<if#j<m, 


2. Whenever 1 <i # j < m, 81 € Qj, and sg € Q;, there exists 
Cé€U that ¢-distinguishes s; and s9. 


3. Whenever 1,52 € Q; for some 1 < j < m, there is no C € U 
that ¢-distinguishes s; and s9. 


Let €* := {64(s1, $2) | 51,52 € S*, 51 ~ sg}. Since ~ is not an equivalence, 
not every subset of E* induces a partition of S* into equivalence classes. 
However, the next lemma shows that €* induces a partition of Q* into at 
least |E*| classes. 


Theorem 4. /f €* = {C),...,Cy} with Cy <c ... <c Cy then, for every 
1<i<k, {C,...,C;} induces a partition of Q* into at least i classes. 


Proof. First, we prove by induction on i, 1 < i < k, that {Ci,...,C;} 
induces a partition of Q*. In the base case, i = 1, {C),...,C;} = {Ci} = {e}, 
and the statement of the lemma is obviously true. In the induction step, 
we assume that {C1,...,C;} induces a partition Qi, ..., Qm of Q*. Let 
Mp := {Q; | |Qi| > 1, and M := Ug,em, Qi. As all pairs of elements in M 
are ¢-distinguished by some element of Cj+1,...,C and de(Ci41) < de(C;) 
for alli < j < k, the depth of any term contained in M is at most (—de(Ci+1). 
Thus T(Cj+:[t]) € {0,1} for all t € M, and therefore {C),...,Ci,Ci41} 
induces a partition of Q*. 

Let Cj,,..., Cj, be the order in which the contexts were added to E’ by 


failed consistency checks. Because every C;,,,, ¢-distinguishes some elements 
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of Q* that were not ¢-distinguished by any of C;, with 1 < j < p, we conclude 
that €* induces a partition of Q* into at least k classes. Oo 


Corollary 5. For any t, €* has at most n elements. 


Proof. Let k be the number of elements of €*, and m be the number of 
classes in the partition of Q* induced by €*. By Lemma 4, k < m. Since 
m <n, we conclude that k <n. O 


We will compute an upper bound on the number of failed consistency 
checks by examining the evolution of €* during the execution of LA®. Initially, 


E° = fel. 


Lemma 11. At any time during the execution of the algorithm, if Q* has 
i > 2 elements, then the hole depth of any context in E* is less than or equal 
to1—2. 


Proof. The proof is by induction on the execution step t of the algorithm. 

In the base case, assume Q° has i = 2 elements. Then E° = {e} and 
d.(e) = 0 =i — 2. In the induction case, we assume that the result holds at 
some step t in the execution of the algorithm, and we prove that the result 
holds at the next step t + 1. 

If step t is a failed closedness check or a failed structural equivalence 
query, then E*+! = Et, and Q*+! has at least the same number of elements 
as Q*. Therefore, the result will hold at step t + 1. 

Otherwise, the execution step t is a failed consistency check. Let 
81,82 € S*, C € E*, and C, € o.(S*) be the values for which this failed 
consistency check is performed. Then E’*! = E* U{C[C,]}. We distinguish 
two cases: 


1. s, and 82 are ¢-distinguished by some C’ € E*, but d(C’) > de(C[C}]). 
Then d(C”) < max{d.(C’) | C’ € E*} for all C” € E**!. Since 
max{d,.(C’) | C’ € E*} < i —2 by induction hypothesis, and i = 
|Q*| < |Qtt1], we learn that d.(C”) < |Q*t!| — 2 for all C” € E*+?, 


2. s; and s2 are not ¢-distinguished by any element of E*. If dg(C[C]) < 
max{d.e(C’) | C’ € E*}, the result will hold at step t + 1. Otherwise, 
by induction hypothesis d.(C’) < 7 — 2 and thus d4,(C[C]) < i—1. Let 
R:= OF U{s1, s2}. Since 51 ~¢ s2 at step t, at least one of s; and s¢ is 
not contained in Q*, thus R will have at least |Q*|+1=7i+1 elements. 
As C[C\] ¢-distinguishes s; and sz and d,(C[C]) < max{d,.(C’) | C’ € 
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E*}, a.(C’[si]) < € and d.(C"[s2]) < @ for every C’ € E*. Thus, both 
E* and E*+! will induce a partition of R. As s, ~¢ s2 at step t, 
but s; and so are ¢-distinguished by C[C;] at step t +1, Ett! will 
partition R into at least |Q*| + 1 classes. Thus, |Ott+| > i+ 1. Hence 
d(C”) <i-1=(4+1)-2 < |Q** | - 2 for all C” c E**!. Oo 


Let €* = {C{,...,C{,} before some execution step t of the algorithm 
LA‘, where Ci <c... <c Ci. Then k < n by Cor. 5. We associate to every 
such €* the n-tuple tp1(Q*) = (y1,...,Yn) € {0,1,...,n —1}”", where, for 
every 1 <j <n, y; is defined as follows: 

- If Q* has at least 7 +1 elements then, if 7 is the minimum integer such 
that {Ci,...,Cj} partitions Q* into at least j + 1 classes then yj; = de(C%). 
Since every {C{,...,C{} partitions Q* into at least 7 classes (by Lemma 
4) and we assume that €* = {C{,...,C,} partitions Q* into |Q*| > 7+1 
classes, we conclude that such an 7 exists. 

- otherwise yj =n —1. 


For 1 < j < n we denote the j-th component of tp1(E*) by dh, (j). 
Note that, for all 1 <i <k, d.(C{) < |Q*|—2 by Theorem 11, and |Q*| < n, 
hence do(C) < n — 2. Therefore, we can always distinguish the components 
y; of tp1(Q*) that correspond to the defining case (1) from those in case (2). 


Lemma 12. dh;(j) < 7 — 1 whenever 2 < j <n and dh; (j) An—1. 


Proof. Suppose t’ is the first execution step when dhy(j) 4 n — 1. This 
means that t’ is the first execution step from where on we distinguish at least 
j +1 representatives in the observation table. Therefore, at the previous step 
t’ —1, |Q*-!| <j and so, by Lemma 11, de(C) < j —2 for all C € EB“. 
Thus, d.(C’) < j — 1 for all C’ € E™, and in particular dhy(j) < j —1. 
Since it is obvious that dhy(j) < dhy/(j) whenever t > t’, we conclude that 
dh, (j) < 7 — 1 whenever 2 < j <n and dhy(j) An—1. O 


Theorem 5. If Q* has at least 2 elements then the number of failed consis- 
tency checks over the entire run of LA® is at most n(n — 1)/2. 


Proof. It is easy to see that tp1(Q*) > tp1(Q**") holds for every execution 
step t. Moreover, if the t-th execution step is a failed consistency check then 
a context C is newly added to €* in order to produce €*+!. The context 
C will é-distinguish two elements s1, 52 € S* that were not ¢-distinguished 
before or had been ¢-distinguished by some CO’ € E* with d.(C’) > d(C). 
Since d(r¢41(s1)) < d(s 1) and d(rz41(s2)) < d(s2), C will @-distinguish two 
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elements of S* that were not ¢-distinguished before or were ¢-distinguished 
by a context with bigger hole-depth. Therefore tp1(Q**!) < tp1(Q*) if t is 
a failed consistency check. 

Note that tp1(Q°) = (0,n —1,...,n —1) and the minimum possible 
value of tp1(A*) is (0,1,...,1). Also, by Lemma 12, dh, (j) < j —1 whenever 
dhi(j) An —1 for 2< j <n. Therefore, any run of the algorithm performs 
at most re) — 1) =n(n—1)/2 failed consistency checks. O 


5.3 Failed Structural Equivalence Queries 


Every failed structural equivalence query yields a counterexample which 
increases the number of representatives in Q*. Thus 


Theorem 6. The number of failed structural equivalence queries is at most 
n. 


Proof. Algorithm LA‘ performs a failed structural equivalence query when 
the observation table T(t) is closed, consistent, and has less than n states 
(by Corollary 2 of Theorem 1). Suppose the algorithm performed a failed 
structural equivalence query for A(T(t)) which rendered the counterexample 
t. After extending the S-component of observation table T(t) with all 
subterms of ¢ that were not yet there, the algorithm constructs a new 
observation table T(t’) which is closed and consistent. Since t € S*,e € E™, 
and T(e[t]) in the table T(t) differs from T(e[t]) in the table T(t’), the 
automata A(T(t)) and A(T(t’)) are not equivalent with respect to @ (that is, 
L(A(T(t)))ig A L(A(T(t’)))ig). Therefore, by Theorem 2, the automaton 
A(T(t’)) must have at least one more state than A(T(t’)). Since the number 
of states is increased by every failed structural equivalence query and can 
not exceed n, the number of failed structural equivalence queries performed 
by algorithm L.A is at most n. O 


5.4 Space and Time Complexity 


We are ready now to express the space and time complexity of LA in terms 
of the following parameters: 

- n = the number of states of a minimal DFCA for the language of 
structural descriptions of the unknown grammar with respect to @, 

- m = the maximum size of a counterexample returned by a failed 
structural equivalence query, 

- p = the cardinality of the alphabet © of terminal symbols, and 
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- d = the maximum rank (or arity) of the symbol o € Sk. 


Space complexity. The number of elements in S* is initially 0 (i.e., |S°| = 
0) and is increased either by a failed closedness check or by a failed structural 
membership query. By Theorem 3, the number of failed closedness checks is 
at most n(n + 1)/2, and each of them adds one element to S. By Theorem 
6, the number of failed structural equivalence queries is at most n. A failed 
structural equivalence query which produces a counterexample t with sz(t) < 
m, adds at most m terms to St. Thus, |$*| < n(n+1)/2+nm = O(mn+n?) 
and |S*US| = O(mn+n? +p), therefore |a.(S*)| < 9 +1) |StUDP = 
O(d(mn+n?+p)**) and |X(S*)| < Dj, |S*UE! = O((mn+n? +p). 
Thus S* U X(S*)jq has O((mn +n? + p)*) elements. By Theorem 5, there 
may be at most n(n — 1)/2 failed consistency checks, and each of them 
adds a context to E*. We conclude that both S* U X(S*) and E* have a 
polynomial number of elements. 

Next, we show that both these sets can be encoded to occupy polynomial 
space. S*U X(S*) is subterm closed, therefore can identify its elements 
with the set of nodes of a forest of trees. Because the size of S* U X(S*) is 
polynomial, we can represent this forest of trees in polynomial space. This 
also implies that the elements of o,(S*) can be represented in polynomial 
space. 

E* is e-prefix closed with respect to S* and d.(C) < n— 1 for all 
C € E* (by Lemma 12). This implies that every C € E* is a composition of 
at most n — 2 elements from o,(S*), and therefore it can be represented in 
polynomial space. Since E* has a polynomial number of elements, we learn 
that the space complexity of E* is polynomial. 

We conclude that the space complexity observation table T(t) = 
(S*, E*, T, £) is polynomial. 


Time complexity. We examine the time complexity of the algorithm by 
looking at the time needed to perform each kind of operation. 

Since the consistency checks of the observation table are performed in 
a for loop which checks the result produced by 5; ~% 82 (where 51, s2 € S*) 
in increasing order of k, the result produced by s1 ~,% s2 can be reused in 
checking s; ~%41 Sg and so the corresponding elements in the rows of s1 
and sg are compared only once. Thus, the total time needed to check if the 
observation table is consistent involves at most (|S*|- (|.S*|—1)/2)-|E*|-(1+ 
|oe(S*)|), comparisons. As o,(S*) has O(d(mn + n? + p)*!) elements, a 
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consistency check of the table takes O((mn + n?)?n?d(mn +n? + p)@1) = 
O(n?d(mn+n?+p)4*") time. As there are at most (n (n+1)/2+1) (n+1) = 
O(n?) consistency checks, the total time needed to check if the table is 
consistent is O(n°d(mn-+ n? + p)4?), 

Checking if the observation table is closed takes at most |S*|? + |a.(S*)|- 
|E*| time, which is O((mn + n?)2d(mn +n? + p)&!n?) = O(n?d(mn + 
n2 + p)t1), 

Extending an observation table T(t) with a new element in S**+ requires 
the addition of Daa Ce — 1) =2¢-—d—1 contexts to a.{S*t!) \ a.(S*), 
thus the addition of at most 2¢—d new rows for the new elements of S*t! U 
X(S*+") in the observation table T(t + 1). This extension requires at most 
(2¢ — d)-|E*|- (1+ |o.(S*)|) = O(n?d (24 — d) (mn +n? + p)*') membership 
queries. The number of elements of St! \ S* as a result of a failed structural 
equivalence query is at most m. As there will be at most n failed structural 
equivalence queries and at most n(n + 1)/2 failed closedness checks, the 
maximum number of elements of $t+!\ $* is n(n+1)/2+mn = O(mn+n?). 
Thus the total time spent on inserting new elements in the S-component 
of the observation table is O(n?d (2% — d)(mn + n?)(mn + n? + p)*!), 
Extending an observation table T(t) with a new context in E*t! requires 
at most |S* U X(S*)ig| = O((mn +n? + p)*) membership queries. These 
additions are performed only by failed consistency checks, and there are at 
most n(n — 1)/2 of them. Thus, the total time spent to insert new contexts 
in the E-component of the observation table is O(n? (mn + n? + p)%). We 
conclude that the total time spent to add elements to the components S' 
and E of the observation table is O(n?d (24 — d) (mn +n? + p)%), which is 
polynomial. 

The identification of the representative r;(s) for every s € S* can be 
done by performing (|S*|(|S*| — 1)/2)|E*| = O((mn + n?)?n”) comparisons. 

Thus, all DFCAs A(T(t)) corresponding to consistent and closed obser- 
vation tables T(t) can be constructed in time polynomial in m and n. Since 
the algorithm encounters at most n consistent and closed observation tables, 
the total running time of the algorithm is polynomial in m and n. 


6 Conclusions and Acknowledgments 


We have presented an algorithm, called LA‘, for learning cover context-free 
grammars from structural descriptions of languages of interest. L.A‘ is an 
adaptation of Sakakibara’s algorithm LA for learning context-free grammars 
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from structural descriptions, by following a methodology similar to the design 
of Ipate’s algorithm L*‘ as a nontrivial adaptation of Angluin’s algorithm L*. 
Like L*, our algorithm synthesizes a minimal deterministic cover automaton 
consistent with an observation table maintained via a learning protocol 
based on what is called in the literature a “minimally adequate teacher” [1]. 
And again, like algorithm L*, our algorithm is guaranteed to synthesize the 
desired automaton in time polynomial in n and m, where n is its number 
of states and m is the maximum size of a counterexample to a structural 
equivalence query. As the size of a minimal finite cover automaton is usually 
much smaller than that of a minimal automaton that accepts that language, 
the algorithm LA* is a better choice than algorithm LA for applications 
where we are interested only in an accurate characterisation of the structural 
descriptions with depth at most @. 

This work has been supported by CNCS IDEI Grant PN-IJ-ID-PCE- 
2011-3-0981 “Structure and computational difficulty in combinatorial opti- 
mization: an interdisciplinary approach.” 
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